Skip to main content

How to extract a textual value

Extracting textual values is a prevalent task that can be applied to different domains, regardless of your use case and the documents you want to process (such as invoices, utility bills, contracts, license agreements, payslips, or even a very custom home made document). There is always some static text around that identifies a needed value, even though the value itself can have variable formats and locations on a page.

Anchored text

If you want to process invoices, the invoice number is always important. For the invoice number, the anchor can be INVOICE, Invoice #, Invoice No, and so on.

For documents such as contracts and license agreements, information about the counterparties can also be unlocked that way. The anchors can be Licensee and Licensor, or Owner and Contractor.

It is important that those anchors can always be found on and remain unchangeable for other documents of the same type.

There are two ways a value can be related to its anchor.

Anchor above

In this case, when you define the parsing rules, your choice is the Paragraph selector with the Paragraph name parameter.

In line anchor

The Pattern Finder is the working horse here. As the "anchor" is located "before" the value, you can add this text as a "prefix."

Moreover, if your value follows a typical format, you can add additional validation using the Type parameter. For instance, extracting the total amount from an invoice, if it is worth specifying the type as a Price.

*

Additionally, you can restrict extracted value by the static text following it. It's very useful when the Type is text.